BBC RD 1982/21 

StDB 

RESEARCH DEPARTMENT REPORT 



BROADCAST TELEPHONE SPEECH: 

use of speech synthesis to 

extend the audio bandwidth 



P.S. Gaskell, M.A., C.Eng, M.I.E.E., W.I. Manson, B.SC.{Eng.), 
T.A. Moore, B.E., M.Eng.Sc., M.I.E.E. 



Research Department, Engineering Division 

THE BRITISH BROADCASTING CORPORATION November, 1 982 



BBC RD 1982/21 

UDC 534.781 
621.395.73 



BROADCAST TELEPHONE SPEECH: USE OF SPEECH 

SYNTHESIS TO EXTEND THE AUDIO BANDWIDTH 

P.S. Gaskell, M.A., C.Eng, M.I.E.E., W.I. Manson, B.Sc.(Eng.), 

T.A. Moore, B.E., M.Eng.Sc., M.I.E.E. 



Summary 

Speech signals received over the public telephone network are increasingly 
being used in broadcasting, particularly for contributions from correspondents 
and commentators and in the "phone-in" type of programme. Though telephone - 
speech quality is normally adequate for communicating information it can compare 
very unfavourably with studio speech in the same broadcast programme. 

This Report considers the possibility of making broadcast telephone speech 
more acceptable by using speech synthesis to extend the bandwidth beyond the 
300 Hz to 3.4 kHz limits commonly set by the telephone system. 

Tests have indicated that if the missing components below 300 Hz and 
above 3.4 kHz could be synthesised ideally a marked improvement would be 
achieved. Further tests showed that a practical synthesis of low-frequency 
components based on pitch information extracted from simultaneously recorded 
wide-band speech gave a smaller but distinct improvement. 

However, low-frequency synthesis based on pitch information extracted 
from the available telephone signal was markedly less acceptable - often sounding 
rough, and failing to blend with the telephone signal. This result is attributed to 
multiple, missing, or irregularly-spaced pitch pulses. 

It is concluded, therefore, that telephone signal band-width extension 
would be worthwhile if it could be achieved satisfactorily, but that a very precise 
pitch extractor is required for the low-frequency speech synthesis process. 



Issued under the authority of 



Research Department, Engineering Division, 

BRITISH BROADCASTING CORPORATION Head of Research Department 

November, 1982 

(EL-161) 



This Report mav not be reprodjced in any 
form without the written permission of the 
British Broadcasting Corporation. 

It uses SI units in accordance with B.S. 
document PD 5686. 



BROADCAST TELEPHONE SPEECH: USE OF SPEECH 
SYNTHESIS TO EXTEND THE AUDIO BANDWIDTH 

Section Title Page 

Summary Title Page 

1 . Introduction 1 

2. Simulation of ideal synthesis 1 

2.1 . General 1 

2.2. Subjective evaluation of the simulation of ideal synthesis 2 

3. Practical synthesis of missing speech components 3 

3.1 . General 3 

3.2. Low-frequency synthesis based on pitch-frequency estimation 

from simultaneous wide-band speech signal 3 

3.3. Subjective evaluation of practical synthesis 4 

4. Pitch extraction 5 

4.1. Theory 5 

4.2. Software simulation 6 

4.2.1. General 6 

4.2.2. PDP1 1 simulation 6 

4.2.3. COPAS realisation 7 

4.3. Synthesis of low frequencies from the pitch pulses extracted 

from telephone speech 8 

4.4. Results and discussion 8 

5. Conclusions 9 

6. References 10 

Appendix: Calculation of predictor coefficients 11 



(EL-161) 



© BBC 2006. All rights reserved. Except as provided below, no part of this document may be 
reproduced in any material form (including photocopying or storing it in any medium by electronic 
means) without the prior written permission of BBC Research & Development except in accordance 
with the provisions of the (UK) Copyright, Designs and Patents Act 1988. 

The BBC grants permission to individuals and organisations to make copies of the entire document 
(including this copyright notice) for their own internal use. No copies of this document may be 
published, distributed or made available to third parties whether by paper, electronic or other means 
without the BBC's prior written permission. Where necessary, third parties should be directed to the 
relevant page on BBC's website at http://www.bbc.co.uk/rd/pubs/ for a copy of this document. 



BROADCAST TELEPHONE SPEECH: USE OF SPEECH 

SYNTHESIS TO EXTEND THE AUDIO BANDWIDTH 

P.S. Gaskell, M.A., C.Eng., M.I.E.E., W.I. Manson, B.Sc.(Eng.). 

T.A. Moore, B.E., M.Eng.Sc., M.i.E.E. 



1. Introduction 

The public telephone service is normally 
highly successful in serving its primary purpose - 
namely the provision of an efficient 
speech-communication network on a global scale. 
However, speech contributions received by 
telephone are now being included increasingly in 
broadcast Radio and Television programmes 
and often compare unfavourably with studio 
speech in the same programme; broadcast 
telephone speech may sound unnatural and can be 
a very significant source of irritation to the listener. 
The investigation described in this Report was 
directed primarily at making telephone speech 
sound more natural, and therefore less likely 
to annoy the listener, rather than to improve 
intelligibility. 

Broadcast telephone speech falls essentially 
into one of two categories, depending on whether 
or not the broadcaster has access to the sending 
terminal. In one category - including sports 
commentaries and contributions from 
correspondents, for example - the broadcaster may 
be able to take action at the sending terminal to 
improve the quality of the speech as received, say 
by using a microphone of higher quality than the 
carbon insert commonly used in telephone hand- 
sets. The second category comprises largely 
contributions from the public to the now popular 
"phone-in" type of programme. Here the broad- 
caster has no access to the sending terminal and 
quality -improving techniques can be introduced 
only at the receiving terminal. 

The work described in this Report was 
concerned with action which might be taken at 
the receiving terminal to improve the quality of 
telephone speech for broadcasting, primarily 
with the "phone-in" type of programme in mind - 
but any technique found effective could be applied 
equally to telephoned contributions from commen- 
tators and correspondents. 

The principle factors which can detract 
from the quality of telephone speech for broad- 
casting have been described by M.G. Croll^ . These 
include effects arising largely in the carbon micro- 
phone insert so commonly used - irregular 
amplitude/frequency response characteristics, 
non-linear distortion and noise, for example - 



and effects such as interference and bandwidth 
restriction associated predominately with the 
transmission medium. 

In the earlier work' various techniques were 
considered for improving the technical quality of 
received telephone speech for broadcasting. Unfor- 
tunately, it had to be concluded that even the most 
successful combination of these techniques 
achieved only a slight improvement of quality 
with the telephone circuits used in the tests. 

The present work has been concerned 
specifically with efforts to reduce degradation due 
to only one of the telephone system "defects" 
noted above - namely the restriction of audio 
bandwidth to some 300 Hz to 3.4 kHz. Attempts 
were made during the earlier work 'to extend the 
effective bandwidth by synthesis of missing low- 
frequency components using a technique of full- 
wave rectification of the received signal followed 
by appropriate filtering to obtain components 
in the SO Hz to 300 Hz band. The results were 
found not to be consistently beneficial. 

For the present investigation it was decided 
to attempt to generate a synthetic larynx 
waveform based on an estimation of the "pitch" 
of the speaker's voice, using linear prediction 
techniques. However, it was recogonised that even 
the concept of improving the quality of received 
telephone speech by adding synthesised low- and 
high-frequency components was, to a degree, 
speculative, and a preliminary study was therefore 
carried out to evaluate the potential of 'ideal 
synthesis'. 

2. Simulation of ideal synthesis 

2.1. General 

As noted in the previous Section, it was 
accepted that the proposed investigation was 
somewhat speculative. It was argued, for example, 
that the quality of the 300 Hz to 3.4 kHz "core" 
oi the speech signal, provided by the telephone 
channel, might so determine the overall quality 
that efforts to make telephone speech more 
acceptable for broadcasting by extending the 
bandwidth might be futile, even if perfect synthesis 
could be achieved. It was, therefore, decided to 
assess the potential of ideal synthesis in a 



(EL-161) 



preliminary exercise. 

In ideal synthesis the low-frequency and 
high-frequency components synthesised at the 
receiving terminal would be, in effect, those which 
would have been present in a wide-band signal, 
generated by a high-quality microphone and 
carried on wide-band circuits. Ideal synthesis can 
thus readily be simulated experimentally by 
providing simultaneously a telephone-quality signal 
and a wide-band signal from which low and 
high-frequency components can be extracted 
by filters and combined with the telephone signal. 

In preparation for this investigation a special 
two-track speech recording was prepared. Speech, 
picked up by a telephone handset, held normally, 
was recorded on one track, while wide-band 
speech, picked up by a high-quality studio micro- 
phone about 60 cm from the talker's mouth, was 
recorded on the second track. Recordings were 
made of eight different voices, using two different 
telephone handsets. For the investigations 
described later in this report the telephone speech 
bandwidth was restricted by bandpass fdters of 
300 Hz to 3.4 kHz. 

The presence of the telephone handset near 
the mouth of the person speaking must have 
interfered to some extent with the sound field at 
the high-quality microphone; however, no 
obtrusive effect was detected and the quality from 
the wide-band microphone was considered 
perfectly adequate for the purposes proposed. 

The significance of the time difference of 
approximately 2 ms between the microphone 
output signals (due to the difference in distance of 
the microphones from the talker's mouth) was 
studied briefly in preliminary tests. The addition 
of a 2 ms delay circuit in the telephone channel 
was not found to have any perceptible effect 
in a simple synthesis experiment so was not 
included in the later, more formal, tests. 

The various speakers held the telephone 
handset in different, individual, positions, so the 
relative levels of the telephone and wide-band 
signals were different for each speaker. For the 
assessment of simulated ideal synthesis the relative 
levels of the telephone-quality and high-quality 
signals being combined were optimised empirically 
for each voice. 

2.2. Subjective evaluation of the simulation 
of ideal synthesis 

The potential benefit of ideal synthesis of 



(a) low-frequency speech components 
and (b) low-frequency and high-frequency com- 
ponents 
was assessed subjectively. A panel of listeners was 
asked to compare presentations in random order of 
"enhanced" telephone speech and "direct" 
telephone speech, and to grade their preferences on 
the 7-point comparison scale shown in Table 1. 



TABLE I 
7 point comparison scale 



Grade 

+ 3 

+2 

+1 



-1 

-2 

-3 



B much better than 

B better than 

B slightly better than 

B same as 

B slightly worse than 

B worse than 

B much worse than 



A 
A 
A 
A 
A 
A 
A 



In all, eight pairs of test passages were compared 
and graded by eight non-technical listeners. The 
tests were carried out in a listening room having a 
volume of about 85 m^ and a mean mid-band 
reverberation time of some 0.3s. The tests were 
carried out with the excerpts reproduced on a 
high-quality loudspeaker, both direct and also 
via a network simulating the response of an average 
domestic m.f, a.m. receiver^. 

Results of the tests are given in Table II. 

TABLE II 

Simulation of Ideal Synthesis 





Grading of 'enhanced' telephone speech 
in comparison with direct telephone 
speech. Overall mean score (Standard 
Error of Mean) 


Synthesis 
included 


Wide band 


Via a.m. receiver 

simulator 


l.f. only 
l.f. plus h.f. 


+1.58(0.09) 
+2.04(0.01) 


+ 1.24(0.09) 
+1.9 (0.09) 



These results indicate a marked preference 
on the part of the listeners for the 'enhanced' 
telephone speech when compared with 'ordinary' 
telephone speech. The preference is greater when 
h.f, components are added as well as l.f. 
components and is largely maintained for both wide- 



(EL-161) 



band and receiver-simulated reproduction. 

On the basis of these results it was decided 
to pursue the investigation and to study the 
potential of practical synthesis of the 
low-frequency and high-frequency speech 
components excluded by the telephone system. 

3. Practical synthesis of missing speech 
components 

3.1. General 

The speech synthesis simulation process 
described in the previous Section represents an 
ideal situation in which it is possible to recreate 
precisely the low- and high-frequency speech 
components excluded by the telephone system. 
In practice, of course, this ideal situation does not 
apply - the missing components have to be 
synthesised on the basis of information contained 
within the telephone signal as received. The low- 
frequency speech synthesis process studied here 
was based on an estimation of the pitch of the 
talker's voice, as described below. 

Voiced speech originates at the larynx. It 
consists typically of a pitch component ( a funda- 
mental) and a series of harmonics, the whole being 
shaped by the characteristics of the vocal tract to 
generate the formants. The pitch frequency can be 
as low as say, 50 Hz, for a male speaker with a 
bass voice, so the 300 Hz high-pass restriction of 
the telephone system can exclude the fundamental 
and up to, say, the 5th harmonic. 

threshold 
delectors "V 



wide-bond 
speech 



60Hz 



♦>■ 



280Hz 



160Hz 



90Hz 



The approach to low-frequency synthesis 
adopted was to estimate the pitch, to generate a 
series of harmonics up to 300 Hz, based on this 
fundamental, and to add this fundamental plus 
harmonics - modulated in amplitude by a signal 
derived from part of the available telephone 
signal - to the telephone signal. It was not 
proposed to estimate the formant frequencies or 
to try to imitate the effects of formants on the 
synthesised components. 

Pitch estimation is far from being a trivial 
task, particularly when the signal on which the 
estimate is to be based does not, in fact, contain 
the fundamental component. However, a study of 
the literature dealing with pitch extraction for 
vocoding, voice recognition etc. gave cause for 
some optimism and it was decided to proceed with 
the next stage of the investigation - an assessment 
of the potential benefit of practical synthesis. 
However, in view of the likely complexity of a 
pitch extraction process working on actual 
telephone speech signals, it was decided, as a first 
step, to evaluate a practical synthesis process using 
pitch pulses derived from the simultaneous 
wide-band speech recording mentioned previously. 

3.2. Low-frequency synthesis based on 
pitch-frequency estimation from 
simultaneous wide-band speech signal 

The arrangement used for pitch estimation 
from wide-band speech is shown in Fig. 1. 

The incoming wide-band signal is applied via 



logic 



RJ 



onalogue 
switches 



pitch 300Hz 

triggered 
pulse generotor 
{£k^ Cms pulse) 



Fig. 1 - Pitch extractor/larynx waveform generator (working on wide-band speech). 



(EL-161) 



a 60 Hz high-pass filter to three low-pass filters 
arranged to have cut-off frequencies at somewhat 
less than octave ratios. The outputs of these filters 
are monitored by three threshold detectors which 
control an output analogue switch via a logic 
circuit. 

If any signal is detected in the bottom band 
(60 Hz to 90 Hz) its frequency is taken to 
represent the pitch— since the bandwidth is less than 
an octave no other speech component will be 
present in the band. The logic and switching 
units are arranged so that, in this situation, the 
output of the 90 Hz low-pass fdter is passed, at 
suitable level, to a pulse generator giving output 
pulses of about 1.6 ms duration (about half the 
period of a 300 Hz signal) at the 'pitch' period. 

If no pitch component is detected below 
90 Hz the bands up to 160 Hz and up to 280 Hz 
are examined in turn, and used in that order, so 
that the pulse generator is always driven at the 
lowest component frequency detected in the 
wide-band speech. 

The pitch estimation arrangement, shown in 
Fig. 1, yields at point 'p' a "Larynx waveform" 
comprising rectangular pulses of some 1.6 ms 
duration at a rate determined by the pitch 
frequency estimated from the wide band signal. 
Operating levels were optimised vinth the pitch 
estimator working in the synthesiser circuit shown 
in Fig. 2. 

In operation the input levels of the 
telephone speech and of the wideband speech were 
adjusted to peak consistently to given levels at the 
synthesiser input. Thereafter the operation of the 
low-frequency synthesiser was such as to add 
to the telephone-speech signal the output of 
the pitch extractor/larynx waveform generator 
(Fig. 1), modulated by the components in the 
300 Hz to 1 kHz band of the available speech, 
full-wave rectified and smoothed by a C.R. 
circuit with about 12 ms time constant. The 
level of the synthesised low-frequency 
components added to the telephone signal is 
thus, in effect, "tied" to the level of the 
components in the adjacent part of the 
telephone speech signal. 

In practice, it was found that the 
relative gains in the 'telephone speech' and 
'synthesised speech' paths, once optimised 
empirically on one voice, remained substantially 
optimum for other voices, and they were not 
readjusted individually in the course of the 
subjective assessments described below. 



To provide a parallel for the 'ideal 
synthesis' tests described earlier, in which 
high-frequency speech components were 
"synthesised", a rudimentary form of 'real' 
high-frequency synthesis was provided to be 
used optionally, in addition to low-frequency 
synthesis. The high-frequency synthesiser, 
shown dotted in Fig. 2, comprised simply a 
white noise generator with a 3.5 kHz to 7 kHz 
band restricted output, modulated by the 3 kHz 
to 3.4 kHz components in the telephone signal 
full-wave rectified and smoothed with a C.R. 
circuit of about 0.4 ms time constant. The 
output of this modulator is combined with the 
available telephone signal and with the 
synthesised low-frequency components to yield 
the 'enhanced' telephone speech. 

3.3. Subjective evaluation of practical 
synthesis 

The potential benefit of real synthesis was 
carried out using the apparatus described in the 
previous Section and the subjective test procedure 
used earlier for ideal synthesis and described in 
Section 2.2. The same test passages were used, 
and tests were again carried out for both wide-band 
and for simulated a.m. receiver conditions. Ten 
listeners were available to take part in these tests. 

The results of the subjective tests are given 
in Table III. 

TABLE [II 

Simulation of Real Synthesis 





Grading of 'enhanced' telephone speech 
in comparison with direct teiaphone speech. 
Overall mean scores (Standard Error of Meant 


Synthesis 
included 


Wide band 


Via a.m. receiver 
simulator 


l.f. only 
l.f. plus h.f. 


+0.82 (0.01) 
+0.61 (0.11) 


+0.78 (0.08) 
+ 1.17(0.09) 



These results are less dramatic than those 
for "Ideal Synthesis" as given in Table II, but are 
none the less favourable and encouraging. In only 
one case was the mean score for a particular 
voice/microphone combination negative, and that 
only minimally. Elsewhere the mean score for 
such voice/microphone combinations ranged up to 
+1.8. 

The results listed in Table 111 could 



(EL-161) 



-4- 



wide-band 
speech 




telephone 
speech 



300Hz/34kHz 



r white T rr 

I noise |— >-| ' 
•genera lOTj | " > — | 

35kHz/7kHz 



modulator 



envelope 
detector 



1kHz 



CRii ■12ms 



II 11,1 

L^ ~ , ^ envelope . _^ _, 
T'-jC^r *1 deter tor r *^ H 



detector I 

1 3kHz CRa04ms j ' 1 

I >• |modulator|- - 



300Hz 



enhanced 
telephone speech 

^ 



T 

I 

I 1 

I ;;^ I 3 4kHz 



.^ 1 



Fig. 2 - Experimental low-frequency/high-frequency speech synthesiser using pitch extraction from 

simultaneous wide-band speech. 



undoubtedly be improved by development of the 
techniques used. For example, it is probable that 
the simple 'rectangular' larynx waveform is not 
optimum and that further work could lead to 
improved performance. Further, the high-frequency 
synthesis technique used was very basic, and was 
subjected to virtually no development; an obvious 
step to better performance here would be to 
introduce "go/no-go" gating based on voiced/ 
unvoiced speech decisions. 

However, in view of the encouraging results 
obtained with the existing apparatus it was decided 
not to devote effort to such optimisation at this 
stage but rather ro face the major hurdle of pitch 
extraction from the available telephone speech 
signal, an essential step towards an operational 
system, 

4. Pitch extraction 

4.1. Theory 

Voiced speech is generated by the passage 
of a series of air pulses, originating at the glottis, 
through the vocal tract - in effect applying a 
periodic waveform to a linear system with a 
number of resonances. 

The application of linear prediction to 
determining the pitch, or fundamental frequency, 
of voiced speech is based on a particular model of 
this physical arrangement; the glottal waveform 



is treated as a series of impulses and the vocal 
tract is modelled as an all-pole filter. Implicit in 
this model is the assumption that the vocal tract 
can incorporate the effects of the actual shape of 
the glottal pulse. However, in some analyses the 
predictor is preceded by differentiation (single or 
double) which "sharpens" the pulse shape and, by 
making it more like the impulse assumed in the 
model, improves the fit of the modelling filter. 

Linear prediction is a technique whereby 
each sample of a digitally encoded signal is 
'optimally' predicted in terms of previous samples - 
optimally here meaning that the mean square 
prediction error is minimum. A property of such a 
predictor is that sucessive values of the prediction 
error are uncorrelated (within the span of the 
predictor); this means that the coarse detail of the 
prediction error spectrum is flat, i.e. the prediction 
error resembles 'white noise'. In addition, since 
the prediction time span is far less than the pitch 
pulse repetition period, the prediction error will 
possess a fine 'comb' spectrum, the spacing of 
which corresponds to the pitch frequency. 
Provided the predictor span is large enough to 
accommodate the detail of the vocal tract and 
the shape of the glottal pulse (assuming an all-pole 
model), the resulting error may then be taken to be 
the input to the model of the vocal tract, i.e. the 
impulse train, or pitch pulses. Thus, the excitation 
at the fundamental frequency of the voiced sound 
can be extracted using linear prediction 
techniques. 



(EL-1611 



In vector notation the optimal N**^ order 
predictor, whose coefficients form the vector a, 
is given by 

a = R -^ r 

where R is an NxN correlation matrix of {i,j)^ 
term 

and r is a correlation vector of i * term 

The brackets <,> denote averaging and 
can be taken as a summation over the time interval 
of interest. 

The matrix R is of a special form (Toeplitz); 
because of this and because of the nature of the 
correlation vector r the prediction vector a can be 
evaluated using a variety of algorithms such as 
Durbin's procedure''. These involve solving for a in 
a recursive manner by evaluating the optimum 1'* 
order (single sample) predictor, then the 2"** and 
so on; the N^*^ order predictor is evaluated in terms 
of the (N— 1)*''. An attractive feature of the 
technique is the fact that the residual error power 
is evaluated at each pass, and the iteration can be 
controlled if desired by the value of the residual 
error. 

The details of the algorithm are given in 
the Appendix. 

In calculating the predictor (and hence the 
residual prediction error) on a sample-by-sample 
basis, it is necessary continually to update the pre- 
dictor to take into account the changing nature of 
the signal. The speech waveform is therefore 
divided into segments where the segment length 
must be long enough to build up statistical 
reliability and to enable the low-frequency 
structure of the signal to be extracted, and must 
be short enough to enable the change in signal 
statistics with time to be followed or 'tracked'. 

To avoid truncation effects the blocks 
should be 'windowed'; a raised-cosine window 
(such as a Hamming window) is most useful, and 
it is advisable to have an update time of less than 
the block length (in other words successive blocks 
should overlap). 

4.2. Software simulation 

4.2.1. General 

A simulation of the pitch synthesis 



algorithm was carried out on the Department's 
PDPll computer and also on COPAS^. The 
PDPll simulation was not 'real-rime' in the sense 
that the processing time was orders of magnitude 
larger than the actual duration of the audio passage, 
but the data were 'real' audio digital signals trans- 
ferred onto disc files; COP AS allows real-time 
processing but lacks some of the numerical 
accuracy and programming flexibility of the 
PDPll. However, as we report below, the two 
systems gave similar results. 

The pitch detector algorithms used as input 
data two types of programme material: 

1. High quality speech recorded using a 
studio-quality ribbon microphone, and bandlimited 
to 300 Hz - 1.7 kHz. 

2. Real telephone speech, band-limited 
to 300 Hz -1.7 kHz. 

Also available were high quality speech 
extracts with the full gamut of low frequencies 
(i.e. below 300 Hz). 

Several voices of each type were assessed. 
In addition several 'off-air' phone-in extracts were 
available. 

The simulation included the following steps. 

(i) Differentiate the data (optional) 
(ii) Divide the signal into blocks of data 
(iii) Window the data blocks 
(iv) Calculate the autocorrelation function corres- 
ponding to each block 
(v) Calculate the optimum predictor for that 

block 
(vi) Using the predictor, calculate the residual 
prediction error signal 

The block size could exceed the update 
time (i.e. the time between successive blocks); 
in other words, successive blocks could overlap. 

A Hamming window was used in both 
simulations. 

4.2.2. PDPll Simulation 

The audio extracts were transferred from 
audio tape through an a.d.c. onto computer disc 
files using a direct memory access interface board 
(the DRSllE) associated with the PDPll. The 
audio signals were initially sampled at 8 kHz, after 
lowpass filtering to 3.4 kHz. However, to reduce 
the total processing load, the signals were low-pass 



(EL-161) 



filtered to 1.5 kHz and resampled at 4 kHz by the 
computer. 

The program user selected a particular 
audio item (from a disc file) and specified input 
parameters including block size, update time and 
predictor order. It was also possible to differen- 
tiate (singly or doubly) the input audio signal. 
Varying the parameter values tended not to have 
a great influence on the results and the values were 
near the following: block size 20 msec., update 
time 8 msec, and predictor order 10. 

The residual prediction error could be 
plotted and inspected; it was also combined with 
the original audio signals and read out from the 
computer (to a two-track tape recorder) for 
'real-time' processing (see section 4.3.). 

4.2.3. COPAS realisation 

COPAS'' is a high speed, programmable, 
real-time, audio processor employing a fast bit- 
slice alu (arithmetic logic unit) and multiplier/ 
accumulator. At 32 kHz sampling rate, 192 
microinstructions may be executed between 
sampling pulses. For pitch extraction however, the 
bandwidth of the input signal was limited to about 
1.7 kHz and the sampling rate was reduced to 
4 kHz. This increased by 8 times the number of 
microinstructions between samples and greatly 
extended the programming power. 

Before the full LPC algorithm of Section 
4.1. was implemented, some much simpler 
algorithms^ were tried. However, these were 
discounted at an early stage since they showed 
little promise for the present application. 

The approach adopted for realising the 
full LPC algorithm on COPAS was the same in 
principle as that implemented on the PDPll and 
Durbin's algorithm was again invoked. This 
algorithm is relatively complex and presented 
certain obstacles in implementing it on COPAS 
as follows: 

a) Speed 

With such a long program, it was important 
to write it as economically as possible to achieve 
the speed needed for reaJ-time operation whilst 
still fulfilling the processing requirements, 

b) Fixed-point operation 

Unlike the PDPll program, COPAS uses 
16-bit, fixed-point arithmetic. Some of the 



processes of the program involve small numbers 
and, at all stages, precautionary measures (such as 
scaling) had to be taken to ensure the greatest 
accuracy possible. 

c) Data memory addressing 

Two programs effectively run concurrently 
for real-time operation, each one accessing 
different blocks of data. A complex addressing 
scheme for the data memory had to be devised to 
read and write data for the various program stages. 
This was perhaps the most complex part of the 
program. 

The two programs are as follows. A back- 
ground program, which takes several audio samples 
(about 10) to complete, computes the prediction 
coefficients from a particular block of data. The 
block is windowed, the autocorrelation function 
is computed and the prediction coefficients are 
derived. The second program which runs every 
sampling period estimates the residual error. The 
two programs are made to run concurrently by 
time-multiplexing. 

The derivation of the prediction coefficients 
is a recursive process as described in Section 4.1. 
The residual error power is found every pass and 
tested against a preselected minimum. If it is less, 
the prediction order is reduced to the value 
computed in the last pass and subsequent 
calculations work to this new, reduced value. 
This is done to avoid the build-up of rounding 
errors. If this preventive action is not taken, 
the error power can actually increase with the 
prediction order. Each pass requires a division 
and a special subroutine was written to execute 
16-bit, fixed-point division, accurately and quickly. 

To run the program the user specified 
the prediction order, the data block size, the 
update time between new prediction 
coefficients, and the minimum error power. It 
was found that the performance of the pitch 
extractor was not very critical of the 
parameters selected and one combination that 
gave good results was as follows: 

Prediction order = 8 

Date block size = 160 samples (20msecs) 

Update time = 64 samples (Smsecs) 

Minimum residual error power = 0.06 

(normalised) 

Differentiating the data prior to weighting 
did not improve the performance and was not 
therefore used. 



(EL-161) 



- 7 



4.3. Synthesis of low frequencies from the 
pitch pulses extracted from telephone 
speech 

Although some effort was devoted to 
synthesising the Jow frequencies from the pitch 
pulses generated by the PDPll, the main effort 
exploited the real-time capability of COPAS. Fig. 
3 shows how the low frequency speech was 
synthesised from the pitch pulses from COPAS and 
how they were added to the mid-band frequencies 
for subjective appraisaJ. The arrangement is similar 
to that described in Section 3.2. 



was expected that these might cause difficulties 
when synthesising the low frequencies (see Section 
4.3.). Type 2 speech (telephone) gave poor results 
and the pitch pulses were barely distinguishable 
from the secondary pulses. 

The similarity of the pitch pulses from the 
two machines does lead to the redeeming 
conclusion that the two programs on the PDPll 
and on COPAS were performing as intended and 
that the precautions taken to guard against 
rounding errors on the fixed-point machine were 
effective. 















COPAS 
pitch 

extroctor 














-7^ 




limiter 




ADC 


DAC 




'^^Z 




limiter 














1-7kHz 














1 7kHz 







-*- 



pitch 300Hz 

triggered 

pulse generator 



telephone 
speech 



m * > 



modulator 



envelope 
detector 



1kHz CR:D:12ms 



300H2 



'enhanced' 
telephone speech 



300Hz /3 4kHz 50ms 

delay 



Fig. 3 - Experimental low-frequency speech synthesiser using COPAS pitch extraction. 



Two important features are the limiters 
before and after COPAS and the delay of the mid- 
frequencies. The limiters increased the effective 
dynamic range of the pitch extractor and the I.f. 
synthesiser and they were set to give about 20 dB 
limiting. The delay was found to be relatively 
critical and 50 msec gave a reasonable match 
between the I.f. and m.f. bands. 

4.4. Results and discussion 

Bodi the PDPll and COPAS processes 
produced distinctive pitch pulses for a limited 
range of test items. Only high quality (Type 1) 
speech gave clearly recognisable pitch pulses 
although they often suffered from secondary, 
smaller pulses between the principal pulses. It 



Overall, the results from the completed 
system were variable and depended greatiy on the 
speaker. As expected, the high quality (Type 1) 
speech fared best. Generally, baritone and bass 
voices gave relatively clean bass frequencies and, 
when added to the m.f. band, they were slightly 
improved. Tenor and female voices on the other 
hand hardly benefitted at all by the addition of the 
synthesised low frequencies. 

The main problem appeared to be that 

the secondary pulses between the 

pitch-triggered pulses were sometimes of 

sufficient amplitude to be detected by the 
pitch-pulse generator. This resulted in spurious 

operation of the systhesiser and gave the 
sound an unpleasant, rasping quality. 



(EL-161} 



-8- 



The effect was much worse when the source 
was real telephone speech. The synthesised bass 
was very unpleasant to listen to and actually 
impaired the sound quality when added to the 
m.f. band. 

It is not clear why the pitch pulses were 
so poor. The most likely reason seems to be 
that the model of the speech on which the 
pitch extraction was based did not always hold 
good. Unvoiced sounds for example do not 
conform to the simple all-pole model of the 
vocal tract. More seriously, the carbon 

microphones of most telephone handsets cause 
non-linear distortion and the all-pole model 
again breaks down. This could account for 
the greatly inferior performance of the pitch 
extractor when treating real telephone speech. 
However, even for high quality, bandlimited 
speech, the pitch pulses were often of poor 
quality particularly for high pitched voices. 
Only for deep voices were the pitch pulses 
normally of good quality. 

5. Conclusions 

Telephone speech is typically band-restricted 
to about 300 Hz to 3.4 kHz and, as a 
result, compares unfavourably with studio 
speech when included in a broadcast 
programme. This report has considered the 
practicability of improving the quality of 
broadcast telephone speech by synthesising 
the speech components excluded by the 
telephone system. 

Subjective tests indicated that the addition 
to recorded telephone speech of 

low-frequency and high-frequency 

components taken from a simultaneous wide 
band recording - i.e. a simulation of 
"ideal" synthesis - gives a marked 
improvement ('better than' on the CCIR 
7-point comparison scale). The investigation 
was therefore extended to study the 
potential of practical synthesis - particularly 
of a synthesis of low-frequency components 
achieved by adding a synthesised glottis 
waveform based on an estimate of the 
pitch, or fundamental component, of voiced 
speech. The success of the approach hinges 
on the ability to estimate the pitch from 
the band of speech available; recreating the 
harmonics is a relatively straightforward 
matter. 

As an initial trial, the pitch was derived from 
simultaneously recorded wide-band speech. 



harmonics were generated and the whole 
modulated and added to the band-hmited speech. 
The low frequency components synthesised in this 
way did not sound as natural as the ideal synthesis. 
Nevertheless, the result was still judged to be an 
improvement. 

Extracting the pitch from the band-limited 
speech proved to be a formidable task. The 
linear prediction techniques were successful for 
high quality (i.e. undistorted), low-pitched 
voices; tenor and female voices usually failed 
to give convincing results. Once the speech 
was distorted, as it normally is by the carbon 
microphone of the telephone handset, the 
model of speech upon which the pitch 
extraction algorithms were based broke down. 
The synthesised low frequencies were then 
totally unacceptable and actually impaired the 
quality of the original telephone speech. 

The exact reasons for the failure to extract 
the pitch are unclear. Linear prediction 
seems ineffective at dealing with distorted 
speech and even when the speech was 
undistorted, it was not always successful. 
Other pitch extraction methods exist but either 
they are less effective than linear prediction or 
they are so complex that carrying out the 
requiste processing in realtime is probably 
beyond the means of current technology. As 
it was, implementing the linear prediction 
algorithms stretched the considerable processing 
power of COPAS near to its limits. 

One may conclude, therefore, that a 
worthwhile improvement could be achieved in 
the quality of broadcast telephone speech if 
missing low-frequency components, excluded 
by the telephone system, could be synthesised 
satisfactorily. However, under practical 

conditions, the problems of extracting pitch 
information with the accuracy needed for 
low-frequency synthesis are formidable. It is 
nevertheless possible that these formidable 
problems will be solved as a result of effort 
stimulated by much larger areas of potential 
applications in fields outside broadcasting. 

Little effort could be devoted to h.f. 
synthesis. One rudimentary synthesis 

technique, that of adding modulated, 

band-limited, noise to form the upper band, 
showed some promise, but to develop this, 
further investigation would be necessary into 
more sophisticated techniques invoking, for 
example, control based on voiced/unvoiced 
decisions. 



(EL-161) 



-9- 



6. References 3. MAKHOUL, J. Linear prediction - a tutorial 

review. Proc IEEE 63, 4, April 1975. 

1. CROLL, M.G. Sound-quality improvement 4. MCNALLY, G.W. COP AS - a high quality 
of broadcasting telephone calls. BBC real-time digital audio processor. BBC 
Research Department Report No. 1972/26. Research Department Report No. 

RD 1979/26. 

2. HACKING, K. and ROACH, R.J.J. Fre- 
quency Response characteristics of four a.m. 5. MAKSYM, J.N. Real-time pitch extraction 
receivers. BBC Research Department Report by adaptive prediction of the speech wave- 
No. 1969/34, form. IEEE transactions on audio and 

electro-acoustics, AU-21, 3, June 1973, 



IEL-1611 - 10 - 



APPENDIX : CALCULATION OF PREDICTOR COEFFICIENTS 

Using the notation r^ = <Xf^Xj^j^ > for the autocorrelation coefficients (where the brackets denote 
averaging), the optimum Nth order predictor coefficients (.'^1,^2, ■•• «n^ ^'^ worked out by first evaluating 
the optimum first order predictor, then the second order predictor in terms of the first, and so on. 

First set Eq = yq 

Then for i = 1 to N, evaluate 

a, 



,.(i) =fl^.('-l) + fe-a^-_.('-l) for; = 1,2 i-1 

£, =(l-A- = )£._i 



vi^here a: ^'^ is the;'^*^ coefficient of the i**^ order predictor. 

The final N*^ order predictor coefficients are 
fl-^^^for; = 1,2,... N. 



(EL-161) —11 — 



